Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
1.
EClinicalMedicine ; 70: 102479, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38685924

RESUMEN

Background: Artificial intelligence (AI) has repeatedly been shown to encode historical inequities in healthcare. We aimed to develop a framework to quantitatively assess the performance equity of health AI technologies and to illustrate its utility via a case study. Methods: Here, we propose a methodology to assess whether health AI technologies prioritise performance for patient populations experiencing worse outcomes, that is complementary to existing fairness metrics. We developed the Health Equity Assessment of machine Learning performance (HEAL) framework designed to quantitatively assess the performance equity of health AI technologies via a four-step interdisciplinary process to understand and quantify domain-specific criteria, and the resulting HEAL metric. As an illustrative case study (analysis conducted between October 2022 and January 2023), we applied the HEAL framework to a dermatology AI model. A set of 5420 teledermatology cases (store-and-forward cases from patients of 20 years or older, submitted from primary care providers in the USA and skin cancer clinics in Australia), enriched for diversity in age, sex and race/ethnicity, was used to retrospectively evaluate the AI model's HEAL metric, defined as the likelihood that the AI model performs better for subpopulations with worse average health outcomes as compared to others. The likelihood that AI performance was anticorrelated to pre-existing health outcomes was estimated using bootstrap methods as the probability that the negated Spearman's rank correlation coefficient (i.e., "R") was greater than zero. Positive values of R suggest that subpopulations with poorer health outcomes have better AI model performance. Thus, the HEAL metric, defined as p (R >0), measures how likely the AI technology is to prioritise performance for subpopulations with worse average health outcomes as compared to others (presented as a percentage below). Health outcomes were quantified as disability-adjusted life years (DALYs) when grouping by sex and age, and years of life lost (YLLs) when grouping by race/ethnicity. AI performance was measured as top-3 agreement with the reference diagnosis from a panel of 3 dermatologists per case. Findings: Across all dermatologic conditions, the HEAL metric was 80.5% for prioritizing AI performance of racial/ethnic subpopulations based on YLLs, and 92.1% and 0.0% respectively for prioritizing AI performance of sex and age subpopulations based on DALYs. Certain dermatologic conditions were significantly associated with greater AI model performance compared to a reference category of less common conditions. For skin cancer conditions, the HEAL metric was 73.8% for prioritizing AI performance of age subpopulations based on DALYs. Interpretation: Analysis using the proposed HEAL framework showed that the dermatology AI model prioritised performance for race/ethnicity, sex (all conditions) and age (cancer conditions) subpopulations with respect to pre-existing health disparities. More work is needed to investigate ways of promoting equitable AI performance across age for non-cancer conditions and to better understand how AI models can contribute towards improving equity in health outcomes. Funding: Google LLC.

2.
Front Artif Intell ; 5: 828187, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35664506

RESUMEN

We propose a novel three-stage FIND-RESOLVE-LABEL workflow for crowdsourced annotation to reduce ambiguity in task instructions and, thus, improve annotation quality. Stage 1 (FIND) asks the crowd to find examples whose correct label seems ambiguous given task instructions. Workers are also asked to provide a short tag that describes the ambiguous concept embodied by the specific instance found. We compare collaborative vs. non-collaborative designs for this stage. In Stage 2 (RESOLVE), the requester selects one or more of these ambiguous examples to label (resolving ambiguity). The new label(s) are automatically injected back into task instructions in order to improve clarity. Finally, in Stage 3 (LABEL), workers perform the actual annotation using the revised guidelines with clarifying examples. We compare three designs using these examples: examples only, tags only, or both. We report image labeling experiments over six task designs using Amazon's Mechanical Turk. Results show improved annotation accuracy and further insights regarding effective design for crowdsourced annotation tasks.

3.
Lancet Digit Health ; 4(4): e235-e244, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-35272972

RESUMEN

BACKGROUND: Diabetic retinopathy is a leading cause of preventable blindness, especially in low-income and middle-income countries (LMICs). Deep-learning systems have the potential to enhance diabetic retinopathy screenings in these settings, yet prospective studies assessing their usability and performance are scarce. METHODS: We did a prospective interventional cohort study to evaluate the real-world performance and feasibility of deploying a deep-learning system into the health-care system of Thailand. Patients with diabetes and listed on the national diabetes registry, aged 18 years or older, able to have their fundus photograph taken for at least one eye, and due for screening as per the Thai Ministry of Public Health guidelines were eligible for inclusion. Eligible patients were screened with the deep-learning system at nine primary care sites under Thailand's national diabetic retinopathy screening programme. Patients with a previous diagnosis of diabetic macular oedema, severe non-proliferative diabetic retinopathy, or proliferative diabetic retinopathy; previous laser treatment of the retina or retinal surgery; other non-diabetic retinopathy eye disease requiring referral to an ophthalmologist; or inability to have fundus photograph taken of both eyes for any reason were excluded. Deep-learning system-based interpretations of patient fundus images and referral recommendations were provided in real time. As a safety mechanism, regional retina specialists over-read each image. Performance of the deep-learning system (accuracy, sensitivity, specificity, positive predictive value [PPV], and negative predictive value [NPV]) were measured against an adjudicated reference standard, provided by fellowship-trained retina specialists. This study is registered with the Thai national clinical trials registry, TCRT20190902002. FINDINGS: Between Dec 12, 2018, and March 29, 2020, 7940 patients were screened for inclusion. 7651 (96·3%) patients were eligible for study analysis, and 2412 (31·5%) patients were referred for diabetic retinopathy, diabetic macular oedema, ungradable images, or low visual acuity. For vision-threatening diabetic retinopathy, the deep-learning system had an accuracy of 94·7% (95% CI 93·0-96·2), sensitivity of 91·4% (87·1-95·0), and specificity of 95·4% (94·1-96·7). The retina specialist over-readers had an accuracy of 93·5 (91·7-95·0; p=0·17), a sensitivity of 84·8% (79·4-90·0; p=0·024), and specificity of 95·5% (94·1-96·7; p=0·98). The PPV for the deep-learning system was 79·2 (95% CI 73·8-84·3) compared with 75·6 (69·8-81·1) for the over-readers. The NPV for the deep-learning system was 95·5 (92·8-97·9) compared with 92·4 (89·3-95·5) for the over-readers. INTERPRETATION: A deep-learning system can deliver real-time diabetic retinopathy detection capability similar to retina specialists in community-based screening settings. Socioenvironmental factors and workflows must be taken into consideration when implementing a deep-learning system within a large-scale screening programme in LMICs. FUNDING: Google and Rajavithi Hospital, Bangkok, Thailand. TRANSLATION: For the Thai translation of the abstract see Supplementary Materials section.


Asunto(s)
Aprendizaje Profundo , Diabetes Mellitus , Retinopatía Diabética , Edema Macular , Estudios de Cohortes , Retinopatía Diabética/diagnóstico , Humanos , Edema Macular/diagnóstico , Estudios Prospectivos , Tailandia
4.
J Diabetes Res ; 2020: 8839376, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33381600

RESUMEN

OBJECTIVE: To evaluate diabetic retinopathy (DR) screening via deep learning (DL) and trained human graders (HG) in a longitudinal cohort, as case spectrum shifts based on treatment referral and new-onset DR. METHODS: We randomly selected patients with diabetes screened twice, two years apart within a nationwide screening program. The reference standard was established via adjudication by retina specialists. Each patient's color fundus photographs were graded, and a patient was considered as having sight-threatening DR (STDR) if the worse eye had severe nonproliferative DR, proliferative DR, or diabetic macular edema. We compared DR screening via two modalities: DL and HG. For each modality, we simulated treatment referral by excluding patients with detected STDR from the second screening using that modality. RESULTS: There were 5,738 patients (12.3% STDR) in the first screening. DL and HG captured different numbers of STDR cases, and after simulated referral and excluding ungradable cases, 4,148 and 4,263 patients remained in the second screening, respectively. The STDR prevalence at the second screening was 5.1% and 6.8% for DL- and HG-based screening, respectively. Along with the prevalence decrease, the sensitivity for both modalities decreased from the first to the second screening (DL: from 95% to 90%, p = 0.008; HG: from 74% to 57%, p < 0.001). At both the first and second screenings, the rate of false negatives for the DL was a fifth that of HG (0.5-0.6% vs. 2.9-3.2%). CONCLUSION: On 2-year longitudinal follow-up of a DR screening cohort, STDR prevalence decreased for both DL- and HG-based screening. Follow-up screenings in longitudinal DR screening can be more difficult and induce lower sensitivity for both DL and HG, though the false negative rate was substantially lower for DL. Our data may be useful for health-economics analyses of longitudinal screening settings.


Asunto(s)
Aprendizaje Profundo , Retinopatía Diabética/diagnóstico por imagen , Fondo de Ojo , Interpretación de Imagen Asistida por Computador , Edema Macular/diagnóstico por imagen , Tamizaje Masivo , Fotograbar , Anciano , Proliferación Celular , Retinopatía Diabética/epidemiología , Femenino , Humanos , Incidencia , Estudios Longitudinales , Edema Macular/epidemiología , Masculino , Persona de Mediana Edad , Programas Nacionales de Salud , Valor Predictivo de las Pruebas , Prevalencia , Reproducibilidad de los Resultados , Índice de Severidad de la Enfermedad , Tailandia/epidemiología
5.
Transl Vis Sci Technol ; 8(6): 40, 2019 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-31867141

RESUMEN

PURPOSE: To present and evaluate a remote, tool-based system and structured grading rubric for adjudicating image-based diabetic retinopathy (DR) grades. METHODS: We compared three different procedures for adjudicating DR severity assessments among retina specialist panels, including (1) in-person adjudication based on a previously described procedure (Baseline), (2) remote, tool-based adjudication for assessing DR severity alone (TA), and (3) remote, tool-based adjudication using a feature-based rubric (TA-F). We developed a system allowing graders to review images remotely and asynchronously. For both TA and TA-F approaches, images with disagreement were reviewed by all graders in a round-robin fashion until disagreements were resolved. Five panels of three retina specialists each adjudicated a set of 499 retinal fundus images (1 panel using Baseline, 2 using TA, and 2 using TA-F adjudication). Reliability was measured as grade agreement among the panels using Cohen's quadratically weighted kappa. Efficiency was measured as the number of rounds needed to reach a consensus for tool-based adjudication. RESULTS: The grades from remote, tool-based adjudication showed high agreement with the Baseline procedure, with Cohen's kappa scores of 0.948 and 0.943 for the two TA panels, and 0.921 and 0.963 for the two TA-F panels. Cases adjudicated using TA-F were resolved in fewer rounds compared with TA (P < 0.001; standard permutation test). CONCLUSIONS: Remote, tool-based adjudication presents a flexible and reliable alternative to in-person adjudication for DR diagnosis. Feature-based rubrics can help accelerate consensus for tool-based adjudication of DR without compromising label quality. TRANSLATIONAL RELEVANCE: This approach can generate reference standards to validate automated methods, and resolve ambiguous diagnoses by integrating into existing telemedical workflows.

6.
Ophthalmology ; 126(12): 1627-1639, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-31561879

RESUMEN

PURPOSE: To develop and validate a deep learning (DL) algorithm that predicts referable glaucomatous optic neuropathy (GON) and optic nerve head (ONH) features from color fundus images, to determine the relative importance of these features in referral decisions by glaucoma specialists (GSs) and the algorithm, and to compare the performance of the algorithm with eye care providers. DESIGN: Development and validation of an algorithm. PARTICIPANTS: Fundus images from screening programs, studies, and a glaucoma clinic. METHODS: A DL algorithm was trained using a retrospective dataset of 86 618 images, assessed for glaucomatous ONH features and referable GON (defined as ONH appearance worrisome enough to justify referral for comprehensive examination) by 43 graders. The algorithm was validated using 3 datasets: dataset A (1205 images, 1 image/patient; 18.1% referable), images adjudicated by panels of GSs; dataset B (9642 images, 1 image/patient; 9.2% referable), images from a diabetic teleretinal screening program; and dataset C (346 images, 1 image/patient; 81.7% referable), images from a glaucoma clinic. MAIN OUTCOME MEASURES: The algorithm was evaluated using the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity for referable GON and glaucomatous ONH features. RESULTS: The algorithm's AUC for referable GON was 0.945 (95% confidence interval [CI], 0.929-0.960) in dataset A, 0.855 (95% CI, 0.841-0.870) in dataset B, and 0.881 (95% CI, 0.838-0.918) in dataset C. Algorithm AUCs ranged between 0.661 and 0.973 for glaucomatous ONH features. The algorithm showed significantly higher sensitivity than 7 of 10 graders not involved in determining the reference standard, including 2 of 3 GSs, and showed higher specificity than 3 graders (including 1 GS), while remaining comparable to others. For both GSs and the algorithm, the most crucial features related to referable GON were: presence of vertical cup-to-disc ratio of 0.7 or more, neuroretinal rim notching, retinal nerve fiber layer defect, and bared circumlinear vessels. CONCLUSIONS: A DL algorithm trained on fundus images alone can detect referable GON with higher sensitivity than and comparable specificity to eye care providers. The algorithm maintained good performance on an independent dataset with diagnoses based on a full glaucoma workup.


Asunto(s)
Aprendizaje Profundo , Glaucoma de Ángulo Abierto/diagnóstico , Oftalmólogos , Disco Óptico/patología , Enfermedades del Nervio Óptico/diagnóstico , Especialización , Anciano , Área Bajo la Curva , Conjuntos de Datos como Asunto , Femenino , Humanos , Masculino , Persona de Mediana Edad , Fibras Nerviosas/patología , Curva ROC , Derivación y Consulta , Células Ganglionares de la Retina/patología , Estudios Retrospectivos , Sensibilidad y Especificidad
7.
Seizure ; 71: 93-99, 2019 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-31229939

RESUMEN

PURPOSE: Children with epilepsy in low-income countries often go undiagnosed and untreated. We examine a portable, low-cost smartphone-based EEG technology in a heterogeneous pediatric epilepsy cohort in the West African Republic of Guinea. METHODS: Children with epilepsy were recruited at the Ignace Deen Hospital in Conakry, 2017. Participants underwent sequential EEG recordings with an app-based EEG, the Smartphone Brain Scanner-2 (SBS2) and a standard Xltek EEG. Raw EEG data were transmitted via Bluetooth™ connection to an Android™ tablet and uploaded for remote EEG specialist review and reporting via a new, secure web-based reading platform, crowdEEG. The results were compared to same-visit Xltek 10-20 EEG recordings for identification of epileptiform and non-epileptiform abnormalities. RESULTS: 97 children meeting the International League Against Epilepsy's definition of epilepsy (49 male; mean age 10.3 years, 29 untreated with an antiepileptic drug; 0 with a prior EEG) were enrolled. Epileptiform discharges were detected on 21 (25.3%) SBS2 and 31 (37.3%) standard EEG recordings. The SBS2 had a sensitivity of 51.6% (95%CI 32.4%, 70.8%) and a specificity of 90.4% (95%CI 81.4%, 94.4%) for all types of epileptiform discharges, with positive and negative predictive values of 76.2% and 75.8% respectively. For generalized discharges, the SBS2 had a sensitivity of 43.5% with a specificity of 96.2%. CONCLUSIONS: The SBS2 has a moderate sensitivity and high specificity for the detection of epileptiform abnormalities in children with epilepsy in this low-income setting. Use of the SBS2+crowdEEG platform permits specialist input for patients with previously poor access to clinical neurophysiology expertise.


Asunto(s)
Electroencefalografía/normas , Epilepsia/diagnóstico , Aplicaciones Móviles/normas , Teléfono Inteligente/normas , Telemedicina/normas , Adolescente , Niño , Preescolar , Electroencefalografía/instrumentación , Femenino , Guinea , Humanos , Lactante , Masculino , Monitorización Neurofisiológica , Sensibilidad y Especificidad , Telemedicina/instrumentación , Telemedicina/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...